Speech recognition について

Words near each other

・ "O" Is for Outlaw
・ "O"-Jung.Ban.Hap.
・ "Ode-to-Napoleon" hexachord
・ "Oh Yeah!" Live
・ "Our Contemporary" regional art exhibition (Leningrad, 1975)
・ "P" Is for Peril
・ "Pimpernel" Smith
・ "Polish death camp" controversy
・ "Pro knigi" ("About books")
・ "Prosopa" Greek Television Awards
・ "Pussy Cats" Starring the Walkmen
・ "Q" Is for Quarry
・ "R" Is for Ricochet
・ "R" The King (2016 film)
・ "Rags" Ragland
・ ! (album)
・ ! (disambiguation)
・ !!
・ !!!
・ !!! (album)
・ !!Destroy-Oh-Boy!!
・ !Action Pact!
・ !Arriba! La Pachanga
・ !Hero
・ !Hero (album)
・ !Kung language
・ !Oka Tokat
・ !PAUS3
・ !T.O.O.H.!
・ !Women Art Revolution

Dictionary Lists

mini英和辞書

翻訳と辞書　辞書検索 [ 開発暫定版 ]

スポンサードリンク

Voice to text ：ウィキペディア英語版

Speech recognition

Speech recognition (SR) is the inter-disciplinary subfield of computational linguistics which incorporates knowledge and research in the linguistics, computer science, and electrical engineering fields to develop methodologies and technologies that enables the recognition and translation of spoken language into text by computers and computerized devices such as those categoriezed as Smart Technologies and robotics. It is also known as "automatic speech recognition" (ASR), "computer speech recognition", or just "speech to text" (STT).
Some SR systems use "training" (also called "enrollment") where an individual speaker reads text or isolated vocabulary into the system. The system analyzes the person's specific voice and uses it to fine-tune the recognition of that person's speech, resulting in increased accuracy. Systems that do not use training are called "speaker independent"〔(【引用サイトリンク】title=Speaker Independent Connected Speech Recognition- Fifth Generation Computer Corporation )〕 systems. Systems that use training are called "speaker dependent".
Speech recognition applications include voice user interfaces such as voice dialling (e.g. "Call home"), call routing (e.g. "I would like to make a collect call"), domotic appliance control, search (e.g. find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g. a radiology report), speech-to-text processing (e.g., word processors or emails), and aircraft (usually termed Direct Voice Input).
The term ''voice recognition''〔(【引用サイトリンク】 title=British English definition of voice recognition )〕〔(【引用サイトリンク】 title=voice recognition, definition of )〕〔(【引用サイトリンク】title=The Mailbag LG #114 )〕 or ''speaker identification'' refers to identifying the speaker, rather than what they are saying. Recognizing the speaker can simplify the task of translating speech in systems that have been trained on a specific person's voice or it can be used to authenticate or verify the identity of a speaker as part of a security process.
From the technology perspective, speech recognition has a long history with several waves of major innovations. Most recently, the field has benefited from advances in deep learning and big data. The advances are evidenced not only by the surge of academic papers published in the field, but more importantly by the world-wide industry adoption of a variety of deep learning methods in designing and deploying speech recognition systems. These speech industry players include Microsoft, Google, IBM, Baidu (China), Apple, Amazon, Nuance, IflyTek (China), many of which have publicized the core technology in their speech recognition systems being based on deep learning.
==History==
As early as 1932, Bell Labs researchers like Harvey Fletcher were investigating the science of speech perception. In 1952 three Bell Labs researchers built a system for single-speaker digit recognition. Their system worked by locating the formants in the power spectrum of each utterance. The 1950s era technology was limited to single-speaker systems with vocabularies of around ten words.
Unfortunately, funding at Bell Labs dried up for several years when, in 1969, the influential John Pierce wrote an open letter that was critical of speech recognition research. Pierce's letter compared speech recognition to "schemes for turning water into gasoline, extracting gold from the sea, curing cancer, or going to the moon." Pierce defunded speech recognition research at Bell Labs.
Raj Reddy was the first person to take on continuous speech recognition as a graduate student at Stanford University in the late 1960s. Previous systems required the users to make a pause after each word. Reddy's system was designed to issue spoken commands for the game of chess. Also around this time Soviet researchers invented the dynamic time warping algorithm and used it to create a recognizer capable of operating on a 200-word vocabulary. Achieving speaker independence was a major unsolved goal of researchers during this time period.
In 1971, DARPA funded five years of speech recognition research through its Speech Understanding Research program with ambitious end goals including a minimum vocabulary size of 1,000 words. BBN. IBM., Carnegie Mellon and Stanford Research Institute all participated in the program. The government funding revived speech recognition research that had been largely abandoned in the United States after John Pierce's letter. Despite the fact that CMU's Harpy system met the goals established at the outset of the program, many of the predictions turned out to be nothing more than hype disappointing DARPA administrators. This disappointment led to DARPA not continuing the funding. Several innovations happened during this time, such as the invention of beam search for use in CMU's Harpy system.〔Lowerre, Bruce. "The Harpy Speech Recognition System", Ph.D. thesis, Carnegie Mellon University, 1976〕 The field also benefited from the discovery of several algorithms in other fields such as linear predictive coding and cepstral analysis.
During the late 1960s Leonard Baum developed the mathematics of Markov chains at the Institute for Defense Analysis. At CMU, Raj Reddy's student James Baker and his wife Janet Baker began using the Hidden Markov Model (HMM) for speech recognition.〔http://ethw.org/First-Hand:The_Hidden_Markov_Model〕 James Baker had learned about HMMs from a summer job at the Institute of Defense Analysis during his undergraduate education.〔http://www.sarasinstitute.org/Audio/JimBaker(2006).mp3〕 The use of HMMs allowed researchers to combine different sources of knowledge, such as acoustics, language, and syntax, in a unified probabilistic model.
Under Fred Jelinek's lead, IBM created a voice activated typewriter called Tangora, which could handle a 20,000 word vocabulary by the mid 1980s.〔(【引用サイトリンク】url=http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/speechreco/ )〕 Jelinek's statistical approach put less emphasis on emulating the way the human brain processes and understands speech in favor of using statistical modeling techniques like HMMs. (Jelinek's group independently discovered the application of HMMs to speech.) This was controversial with linguists since HMMs are too simplistic to account for many common features of human languages. However, the HMM proved to be a highly useful way for modeling speech and replaced dynamic time warping to become the dominate speech recognition algorithm in the 1980s.
IBM had a few competitors including Dragon Systems founded by James and Janet Baker in 1982.〔(【引用サイトリンク】url=http://www.dragon-medical-transcription.com/history_speech_recognition.html )〕 The 1980s also saw the introduction of the n-gram language model.
Much of the progress in the field is owed to the rapidly increasing capabilities of computers. At the end of the DARPA program in 1976, the best computer available to researchers was the PDP-10 with 4 MB ram.〔 Using these computers it could take up to 100 minutes to decode just 30 seconds of speech. A few decades later, researchers had access to tens of thousands of times as much computing power. As the technology advanced and computers got faster, researchers began tackling harder problems such as larger vocabularies, speaker independence, noisy environments and conversational speech. In particular, this shifting to more difficult tasks has characterized DARPA funding of speech recognition since the 1980s. For example, progress was made on speaker independence first by training on a larger variety of speakers and then later by doing explicit speaker adaptation during decoding. Further reductions in word error rate came as researchers shifted acoustic models to be discriminative instead of using maximum likelihood models.
Another one of Raj Reddy's former students, Xuedong Huang, developed the Sphinx-II system at CMU. The Sphinx-II system was the first to do speaker-independent, large vocabulary, continuous speech recognition and it had the best performance in DARPA's 1992 evaluation. Huang went on to found the speech recognition group at Microsoft in 1993.
The 1990s saw the first introduction of commercially successful speech recognition technologies. By this point, the vocabulary of the typical commercial speech recognition system was larger than the average human vocabulary.〔 In 2000, Lernout & Hauspie acquired Dragon Systems and was an industry leader until an accounting scandal brought an end to the company in 2001. The L&H speech technology was bought by ScanSoft which became Nuance in 2005. Apple originally licensed software from Nuance to provide speech recognition capability to its digital assistant Siri.

抄文引用元・出典: フリー百科事典『ウィキペディア（Wikipedia）』
■ウィキペディアで「Speech recognition」の詳細全文を読む

スポンサードリンク

翻訳と辞書 : 翻訳のためのインターネットリソース